The growing problem of unsolicited bulk e-mail, also known as "spam", hasgenerated a need for reliable anti-spam e-mail filters. Filters of this typehave so far been based mostly on manually constructed keyword patterns. Analternative approach has recently been proposed, whereby a Naive Bayesianclassifier is trained automatically to detect spam messages. We test thisapproach on a large collection of personal e-mail messages, which we makepublicly available in "encrypted" form contributing towards standardbenchmarks. We introduce appropriate cost-sensitive measures, investigating atthe same time the effect of attribute-set size, training-corpus size,lemmatization, and stop lists, issues that have not been explored in previousexperiments. Finally, the Naive Bayesian filter is compared, in terms ofperformance, to a filter that uses keyword patterns, and which is part of awidely used e-mail reader.
展开▼